21 research outputs found
Clustering and Classification of Multi-domain Proteins
Rapid development of next-generation sequencing technology has led to an unprecedented growth in protein sequence data repositories over the last decade. Majority of these proteins lack structural and functional characterization. This necessitates design and development of fast, efficient, and sensitive computational tools and algorithms that can classify these proteins into functionally coherent groups.
Domains are fundamental units of protein structure and function. Multi-domain proteins are extremely complex as opposed to proteins that have single or no domains. They exhibit network-like complex evolutionary events such as domain shuffling, domain loss, and domain gain. These events therefore, cannot be represented in the conventional protein clustering algorithms like phylogenetic reconstruction and Markov clustering. In this thesis, a multi-domain protein classification system is developed primarily based on the domain composition of protein sequences. Using the principle of co-clustering (biclustering), both proteins and domains are simultaneously clustered, where each bicluster contains a subset of proteins and domains forming a complete bipartite graph. These clusters are then converted into a network of biclusters based on the domains shared between the clusters, thereby classifying the proteins into similar protein families.
We applied our biclustering network approach on a multi-domain protein family, Regulator of G-protein Signalling (RGS) proteins, where heterogeneous domain composition exists among subfamilies. Our approach showed mostly consistent clustering with the existing RGS subfamilies. The average maximum Jaccard Index scores for the clusters obtained by Markov Clustering and phylogenetic clustering methods against the biclusters were 0.64 and 0.60, respectively. Compared to other clustering methods, our approach uses auxiliary domain information of each protein, and therefore, generates more functionally coherent protein clusters and differentiates each protein subfamily from each other. Biclustered networks on complete nine proteomes showed that the number of multi-domain proteins included in connected biclusters rapidly increased with genome complexity, 48.5% in bacteria to 80% in eukaryotes.
Protein clustering and classification, incorporating such wealth of additonal domain information on protein networks has wide applications and would impact functional analysis and characterization of novel proteins.
Advisers: Stephen D. Scott and Etsuko N. Moriyam
Clustering and Classification of Multi-domain Proteins
Rapid development of next-generation sequencing technology has led to an unprecedented growth in protein sequence data repositories over the last decade. Majority of these proteins lack structural and functional characterization. This necessitates design and development of fast, efficient, and sensitive computational tools and algorithms that can classify these proteins into functionally coherent groups.
Domains are fundamental units of protein structure and function. Multi-domain proteins are extremely complex as opposed to proteins that have single or no domains. They exhibit network-like complex evolutionary events such as domain shuffling, domain loss, and domain gain. These events therefore, cannot be represented in the conventional protein clustering algorithms like phylogenetic reconstruction and Markov clustering. In this thesis, a multi-domain protein classification system is developed primarily based on the domain composition of protein sequences. Using the principle of co-clustering (biclustering), both proteins and domains are simultaneously clustered, where each bicluster contains a subset of proteins and domains forming a complete bipartite graph. These clusters are then converted into a network of biclusters based on the domains shared between the clusters, thereby classifying the proteins into similar protein families.
We applied our biclustering network approach on a multi-domain protein family, Regulator of G-protein Signalling (RGS) proteins, where heterogeneous domain composition exists among subfamilies. Our approach showed mostly consistent clustering with the existing RGS subfamilies. The average maximum Jaccard Index scores for the clusters obtained by Markov Clustering and phylogenetic clustering methods against the biclusters were 0.64 and 0.60, respectively. Compared to other clustering methods, our approach uses auxiliary domain information of each protein, and therefore, generates more functionally coherent protein clusters and differentiates each protein subfamily from each other. Biclustered networks on complete nine proteomes showed that the number of multi-domain proteins included in connected biclusters rapidly increased with genome complexity, 48.5% in bacteria to 80% in eukaryotes.
Protein clustering and classification, incorporating such wealth of additonal domain information on protein networks has wide applications and would impact functional analysis and characterization of novel proteins.
Advisers: Stephen D. Scott and Etsuko N. Moriyam
Limited mitogenomic degradation in response to a parasitic lifestyle in Orobanchaceae
In parasitic plants, the reduction in plastid genome (plastome) size and content is driven predominantly by the loss of photosynthetic genes. The first completed mitochondrial genomes (mitogenomes) from parasitic mistletoes also exhibit significant degradation, but the generality of this observation for other parasitic plants is unclear. We sequenced the complete mitogenome and plastome of the hemiparasite Castilleja paramensis (Orobanchaceae) and compared them with additional holoparasitic, hemiparasitic and nonparasitic species from Orobanchaceae. Comparative mitogenomic analysis revealed minimal gene loss among the seven Orobanchaceae species, indicating the retention of typical mitochondrial function among Orobanchaceae species. Phylogenetic analysis demonstrated that the mobile cox1 intron was acquired vertically from a nonparasitic ancestor, arguing against a role for Orobanchaceae parasites in the horizontal acquisition or distribution of this intron. The C. paramensis plastome has retained nearly all genes except for the recent pseudogenization of four subunits of the NAD(P)H dehydrogenase complex, indicating a very early stage of plastome degradation. These results lend support to the notion that loss of ndh gene function is the first step of plastome degradation in the transition to a parasitic lifestyle
Limited mitogenomic degradation in response to a parasitic lifestyle in Orobanchaceae
In parasitic plants, the reduction in plastid genome (plastome) size and content is driven predominantly by the loss of photosynthetic genes. The first completed mitochondrial genomes (mitogenomes) from parasitic mistletoes also exhibit significant degradation, but the generality of this observation for other parasitic plants is unclear. We sequenced the complete mitogenome and plastome of the hemiparasite Castilleja paramensis (Orobanchaceae) and compared them with additional holoparasitic, hemiparasitic and nonparasitic species from Orobanchaceae. Comparative mitogenomic analysis revealed minimal gene loss among the seven Orobanchaceae species, indicating the retention of typical mitochondrial function among Orobanchaceae species. Phylogenetic analysis demonstrated that the mobile cox1 intron was acquired vertically from a nonparasitic ancestor, arguing against a role for Orobanchaceae parasites in the horizontal acquisition or distribution of this intron. The C. paramensis plastome has retained nearly all genes except for the recent pseudogenization of four subunits of the NAD(P)H dehydrogenase complex, indicating a very early stage of plastome degradation. These results lend support to the notion that loss of ndh gene function is the first step of plastome degradation in the transition to a parasitic lifestyle
Evolution of a Large, Conserved, and Syntenic Gene Family in Insects
The Osiris gene family, first described in Drosophila melanogaster, is clustered in the genomes of all Drosophila species sequenced to date. In D. melanogaster, it explains the enigmatic phenomenon of the triplo-lethal and haploinsufficient locus Tpl. The synteny of Osiris genes in flies is well conserved, and it is one of the largest syntenic blocks in the Drosophila group. By examining the genome sequences of other insects in a wide range of taxonomic orders, we show here that the gene family is well-conserved and syntenic not only in the diptera but across the holometabolous and hemimetabolous insects. Osiris gene homologs have also been found in the expressed sequence tag sequences of various other insects but are absent from all groups that are not insects, including crustacea and arachnids. It is clear that the gene family evolved by gene duplication and neofunctionalization very soon after the divergence of the insects from other arthropods but before the divergence of the insects from one another and that the sequences and synteny have been maintained by selection ever since
Recommended from our members
ClinGen Pathogenicity Calculator: a configurable system for assessing pathogenicity of genetic variants
Abstract Background The success of the clinical use of sequencing based tests (from single gene to genomes) depends on the accuracy and consistency of variant interpretation. Aiming to improve the interpretation process through practice guidelines, the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) have published standards and guidelines for the interpretation of sequence variants. However, manual application of the guidelines is tedious and prone to human error. Web-based tools and software systems may not only address this problem but also document reasoning and supporting evidence, thus enabling transparency of evidence-based reasoning and resolution of discordant interpretations. Results In this report, we describe the design, implementation, and initial testing of the Clinical Genome Resource (ClinGen) Pathogenicity Calculator, a configurable system and web service for the assessment of pathogenicity of Mendelian germline sequence variants. The system allows users to enter the applicable ACMG/AMP-style evidence tags for a specific allele with links to supporting data for each tag and generate guideline-based pathogenicity assessment for the allele. Through automation and comprehensive documentation of evidence codes, the system facilitates more accurate application of the ACMG/AMP guidelines, improves standardization in variant classification, and facilitates collaborative resolution of discordances. The rules of reasoning are configurable with gene-specific or disease-specific guideline variations (e.g. cardiomyopathy-specific frequency thresholds and functional assays). The software is modular, equipped with robust application program interfaces (APIs), and available under a free open source license and as a cloud-hosted web service, thus facilitating both stand-alone use and integration with existing variant curation and interpretation systems. The Pathogenicity Calculator is accessible at http://calculator.clinicalgenome.org. Conclusions By enabling evidence-based reasoning about the pathogenicity of genetic variants and by documenting supporting evidence, the Calculator contributes toward the creation of a knowledge commons and more accurate interpretation of sequence variants in research and clinical care
Reductive evolution and the loss of PDC/PAS domains from the genus \u3ci\u3eStaphylococcus\u3c/i\u3e
Background: The Per-Arnt-Sim (PAS) domain represents a ubiquitous structural fold that is involved in bacterial sensing and adaptation systems, including several virulence related functions. Although PAS domains and the subclass of PhoQ-DcuS-CitA (PDC) domains have a common structure, there is limited amino acid sequence similarity. To gain greater insight into the evolution of PDC/PAS domains present in the bacterial kingdom and staphylococci in specific, the PDC/PAS domains from the genomic sequences of 48 bacteria, representing 5 phyla, were identified using the sensitive search method based on HMM-to-HMM comparisons (HHblits).
Results: A total of 1,007 PAS domains and 686 PDC domains distributed over 1,174 proteins were identified. For 28 Gram-positive bacteria, the distribution, organization, and molecular evolution of PDC/PAS domains were analyzed in greater detail, with a special emphasis on the genus Staphylococcus. Compared to other bacteria the staphylococci have relatively fewer proteins (6–9) containing PDC/PAS domains. As a general rule, the staphylococcal genomes examined in this study contain a core group of seven PDC/PAS domain-containing proteins consisting of WalK, SrrB, PhoR, ArlS, HssS, NreB, and GdpP. The exceptions to this rule are: 1) S. saprophyticus lacks the core NreB protein; 2) S. carnosus has two additional PAS domain containing proteins; 3) S. epidermidis, S. aureus, and S. pseudintermedius have an additional protein with two PDC domains that is predicted to code for a sensor histidine kinase; 4) S. lugdunensis has an additional PDC containing protein predicted to be a sensor histidine kinase.
Conclusions: This comprehensive analysis demonstrates that variation in PDC/PAS domains among bacteria has limited correlations to the genome size or pathogenicity; however, our analysis established that bacteria having a motile phase in their life cycle have significantly more PDC/PAS-containing proteins. In addition, our analysis revealed a tremendous amount of variation in the number of PDC/PAS-containing proteins within genera. This variation extended to the Staphylococcus genus, which had between 6 and 9 PDC/PAS proteins and some of these appear to be previously undescribed signaling proteins. This latter point is important because most staphylococcal proteins that contain PDC/PAS domains regulate virulence factor synthesis or antibiotic resistance